Since 2007, Airbnb has been playing a major role in the travel calendar. Furthermore, it has significantly changed the way people plan and arrange travel. As a unicorn that lets people make better use of their spare homes. Airbnb accelerates and benefits property usage while bringing convenience to travelers. However, legends can also have some weaknesses, especially today - many competitors could easily steal the market from the former giant.
Airbnb’s weaknesses mainly lie in user experience and service delivery.
From our personal experience, Airbnb is weak, especially with price comparisons between its own listings, a recommendation algorithm, and an annoying quick selector, not to mention no features to help you tailor your travel plans to your situation.
Compared with the traditional hotels, Airbnb’s unbranded listings make users diffcult to compare prices effectively, especially since much of the data that can be obtained from the API is not clearly reflected in the quick search interface, resulting in an unsimplified decision-making process.
In addition, UI problems are also obvious. The current quick selector is often difficult to use because of its long list and unclear icons. Additionally, Airbnb lacks comprehensive features to help travelers plan their trip based on their mood, budget, and preferred types of landscapes. Effective utilization of this market blue ocean will significantly improve user loyalty.
As the purpose of our project, we develop an Airbnb auxiliary plug-in to solve related problems, which mainly includes recommending travel destinations and accommodation through user data revenue combined with price predictions. When users search, the system generates recommended house prices for each listing result and gives suggestions.In order to handle the problems mentioned above, a predictive model is the first step.
At this stage, we focus on data collection and cleaning. This includes raw data obtained from the Airbnb API (actually Kaggle for this assignment), including detailed information, reviews and brief information, etc.; population data obtained from CBS, city data obtained from the Amsterdam Municipality Geographic Information website and other data sets converted to Structured format suitable for analysis. This involves cleaning the data to eliminate inconsistencies and irrelevant information. We then conducted exploratory data analysis (EDA) to reveal initial insights and patterns.
In this section, we’ll dive into the specifics of pricing and the general characteristics of Airbnb listings.
We began by importing the Airbnb dataset, which contains detailed information about listings in Amsterdam. This dataset includes a wide array of variables such as price, location, etc.Geo-spatialized data was create for the mapping and data analysis at the next step.
# import planning district and housing data
district <-
st_read("https://raw.githubusercontent.com/ObjQIAN/Airbnb_AMS/main/data/neighbourhoods.geojson") %>%
#dplyr::select(DIST_NAME,ABBREV) %>% #Select data for later prediction
st_transform('EPSG:7415')
nhoods_0 <-
read.csv("https://raw.githubusercontent.com/ObjQIAN/Airbnb_AMS/main/data/listings.csv") %>%
mutate( lat = latitude, lon = longitude)%>%
st_as_sf(coords=c("longitude","latitude"), crs=4326) %>%
st_transform('EPSG:7415')
nhoods <- st_join(nhoods_0, district)
nhoods <- transform(nhoods, price = as.numeric(price), id = as.numeric(id)) %>%
filter(price!= 0 & price > 40 & price < 300) %>%
st_intersection(district)
## do not run
# listing_details <- lapply(listing_details, function(x) {x[x == ""] <- NA})%>%
# as_data_frame()%>%
# na.omit()
sentiment_list<-
read.csv("https://raw.githubusercontent.com/ObjQIAN/Airbnb_AMS/main/data/sentiment_list.csv")
nhoods <- inner_join(nhoods,sentiment_list, by = 'id')
By plotting listing density, prices and more on a map of Amsterdam, we were able to observe geographical patterns in pricing and property types - with city centers having significantly more densely populated Airbnbs. And their average prices are also relatively higher - which is consistent with the typical distribution of travel cities. Spatial analysis also reveals potential correlations between location and price, which indicates that we should subsequently conduct Spatial Lag analysis to enhance the predictive power of the model.
#### (1) Loading and Filtering Data
ggplot() +
geom_sf(data = district, fill = NA)+
geom_hex(data = nhoods, aes(x = st_coordinates(nhoods)[,1], y = st_coordinates(nhoods)[,2]),binwidth = c(300,300)) +
#coord_fixed() +
labs(title = "Listing density Hexbin map in Amsterdam",
subtitle = "Based on 2019 data",
x = "longitude",
y = "latitude") +
scale_fill_viridis()+
theme_light()
ggplot() +
geom_sf(data = district, fill = "grey90", alpha = 0.5) +
geom_sf(data = nhoods, aes(colour = q5(price)),
show.legend = "point", size = .55) +
scale_colour_manual(values = palette5,
labels=qBr(nhoods,"price"),
name="Price") +
labs(title="Airbnb Listed Price",
subtitle = "Amsterdam, 2019") +
mapTheme()
district_data <- nhoods%>%
group_by(neighbourhood)%>%
summarise(average_price = mean(price))%>%
st_drop_geometry()%>%
left_join(.,district)%>%
st_as_sf()
ggplot() +
geom_sf(data = district_data, aes(fill = average_price), alpha = 0.8) +
scale_fill_viridis_c() +
geom_sf(data = nhoods, size = 0.1, color = "#FFD700", alpha = 0.1, show.legend = FALSE) +
labs(title = "Airbnb Average Listed Price by Neighbourhood" ,
subtitle = "Amsterdam, 2019") +
mapTheme()
This section turns to examining the specifics of individual listings and the analysis of specific housing data. Doing so reveals the unique features of each property, thereby increasing the amount of data and improving the accuracy of the model.
We mainly get the data through the listing_details files - they are all accessible through the Airbnb API
# import listing details data
listing_details<- read.csv("https://raw.githubusercontent.com/ObjQIAN/Airbnb_AMS/main/data/listings_details.csv",na.strings = c("", "NA")) %>%
st_as_sf(coords=c("longitude","latitude"), crs=4326) %>%
st_transform('EPSG:7415') %>%
select(id,host_id,host_since,host_response_time,host_response_rate,host_listings_count,host_verifications,host_has_profile_pic,is_location_exact,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,cleaning_fee,extra_people,minimum_nights,maximum_nights,cancellation_policy,review_scores_rating,reviews_per_month,price ) %>%
filter( beds< 10)
#
# ==> 1. host_id:continuous numeric value
# ==> 3. minimum_nights:continuous numeric value
# ==> 4. maximum_nights:continuous numeric value
# ==> 5. accommodates:continuous numeric value
# ==> 7. host_has_profile_pic:2 categories
# ==> 8. is_location_exact:2 categories
# ==> 9. room_type:3 categories
# ==> 10. cancellation_policy:4 categories
# ==> 11. bed_type:5 categories
# ==> 12. property_type:31 categories
# date time calculation
listing_details$host_since <- as.Date(listing_details$host_since, format = "%Y-%m-%d")
target_date <- as.Date("2019-12-31")
listing_details$host_since_days <- as.numeric(target_date - listing_details$host_since)
#===> Part 2 | 2 columns need to change specific "N/A" value
listing_details<- listing_details%>% mutate(host_response_time = ifelse(host_response_time == 'N/A', 'unknown', host_response_time))
listing_details$price <- as.numeric(gsub("\\$", "", listing_details$price))
listing_details$host_response_rate <- as.numeric(gsub("%", "", listing_details$host_response_rate))
listing_details <- listing_details %>%
mutate(
host_response_rate = case_when(
host_response_rate %in% 0:1 ~ '~0%',
host_response_rate %in% 2:25 ~ '1-25%',
host_response_rate %in% 26:35 ~ '26-35%',
host_response_rate %in% 36:45 ~ '36-45%',
host_response_rate %in% 46:55 ~ '46-55%',
host_response_rate %in% 56:70 ~ '56-70%',
host_response_rate %in% 71:79 ~ '70-79%',
host_response_rate %in% 80:85 ~ '80-85%',
host_response_rate %in% 86:90 ~ '86-90%',
host_response_rate %in% 91:95 ~ '91-95%',
host_response_rate %in% 96:98 ~ '96-98%',
host_response_rate %in% 99:100 ~ '99-100%',
is.na(host_response_rate) ~ 'no data'))
# ==> Part 3 | 4 columns need to delete a few NA values
listing_details <- listing_details[!is.na(listing_details$host_since) &
!is.na(listing_details$bathrooms) &
!is.na(listing_details$bedrooms) &
!is.na(listing_details$beds), ]
#======> Part 4 | 3 columns need to delete nearly 2400 in total NA values
listing_details$cleaning_fee <-
as.numeric(gsub("\\$", "", listing_details$cleaning_fee))
listing_details$cleaning_fee[is.na(listing_details$cleaning_fee)] <- 0
listing_details$extra_people <-
as.numeric(gsub("\\$", "", listing_details$extra_people))
listing_details$extra_people[is.na(listing_details$extra_people)] <- 0
listing_details$reviews_per_month[is.na(listing_details$reviews_per_month)] <- 0
# ==> Part 5: array columns
listing_details$host_veri_length <-
sapply(listing_details$host_verification,
function(x) length(strsplit(gsub("\\[|\\]|'", "", x),
",\\s*")[[1]]))
listing_details$amenities_lengths <- sapply(listing_details$amenities, function(x) {
cleaned_content <- gsub("^\\{|\\}$", "", x)
elements <- strsplit(cleaned_content, ",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", perl = TRUE)[[1]]
elements <- trimws(gsub("^\"|\"$", "", elements))
length(elements)
})
listing_details <- listing_details %>%
select(-amenities,-host_verifications,-host_since) %>%
filter(minimum_nights<32 & bathrooms < 11) %>%
na.omit()
column_names_list <- names(listing_details)
column_names_list <- paste(column_names_list, collapse = ",")
listing_details_numeric <- sapply(listing_details, is.numeric)
listing_details_numeric <- listing_details[,listing_details_numeric]%>% st_drop_geometry()
listing_details_categorical <- sapply(listing_details, function(x) is.factor(x) || is.character(x))
listing_details_categorical["price"] <- TRUE
listing_details_categorical<- listing_details[,listing_details_categorical]%>% st_drop_geometry()
#column_names_list
#id,host_id,host_response_time,host_response_rate,host_listings_count,host_has_profile_pic,is_location_exact,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,cleaning_fee,extra_people,minimum_nights,maximum_nights,cancellation_policy,review_scores_rating,reviews_per_month,host_since_days,host_veri_length,amenities_lengths
We focus on understanding the correlation between various variables from the Airbnb listing details and the prices, as well as in between them.
To make the analysis easier, all the data were divided into categorical data and numeric data.
For the categorical data, We made plots to show either the “Price as
a function of categorical variables from Airbnb details” and
“Categorical variables amounts distribution from Airbnb details”. By
doing this, we can find out the relationship behand categories, and we
select
cancellation_policy,bed_type,property_type,room_type
as varibles for modeling.
st_drop_geometry(listing_details_categorical) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_histogram(data = . %>% filter(price >0),stat = 'identity', fill = "#2A9D8F") +
facet_wrap(~Variable, ncol = 3, scales = "free") +
labs(title = "Categorical variables amounts distribution from Airbnb details") +
plotTheme()
st_drop_geometry(listing_details_categorical) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(x = Value, y = price)) +
stat_summary(fun = mean, geom = "bar", fill = "#2A9D8F") +
facet_wrap(~Variable, ncol = 3, scales = "free") +
labs(title = "Price as a function of categorical variables from Airbnb details") +
plotTheme()
# select cancellation_policy,bed_type,property_type,room_type
For the numeric data, we furthermore used ggcorrplot to
analyse the relationship among them, and find out that
accommodates,bathrooms,bedrooms,cleaning_fee,amenities_lengths,extra_people,reviews_per_month
have significant impact on the price.
st_drop_geometry(listing_details_numeric) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .5, color = "#2A9D8F", alpha = 0.45) +
geom_smooth(data = . %>% filter(price >0), method = "lm", se=F, colour = "#005B96") +
facet_wrap(~Variable, ncol = 4, scales = "free") +
labs(title = "Price as a function of continuous variables from Airbnb details") +
plotTheme()
ggcorrplot(
round(cor(listing_details_numeric%>%st_drop_geometry()%>%na.omit()), 1),
p.mat = cor_pmat(listing_details_numeric%>%st_drop_geometry()%>%na.omit()),
colors = c("#E63946", "white", "#2A9D8F"),
type="lower",
insig = "blank",
lab = TRUE) +
labs(title = "Correlation across listing details")
#select accommodates,bathrooms,bedrooms,cleaning_fee,amenities_lengths,extra_people,reviews_per_month
listing_details <- listing_details %>%
select(-price)
In this section, we delve into the sentiment analysis of previous review data for Airbnb listings. Sentiment analysis is a powerful tool to that shows more detail rather than the review star. By analyzing this, we can fit the moods pattern to future functions of the recommendation algorithm and it can offer valuable insights into how guests perceive different aspects of their stay.
We conduct the analysis in Python using the
NLTK and BLOB package, and get all the data
back to R by exporting a CSV file. The Polarity is the
major factor that is brought into the model building.
sentiment_res<-nhoods%>%st_drop_geometry()%>%select(polarity_name,polarity_overview,polarity_rules,review_polarity,price)%>%na.omit()
ggcorrplot(
round(cor(sentiment_res), 1),
p.mat = cor_pmat(sentiment_res),
colors = c("#E63946", "white", "#2A9D8F"),
type="lower",
insig = "blank",
lab = TRUE) +
labs(title = "Correlation across sentiment results")
sentiment_res %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .5, color = "#2A9D8F", alpha = 0.45) +
geom_smooth(data = . %>% filter(price >0), method = "lm", se=F, colour = "#005B96") +
facet_wrap(~Variable, ncol = 4, scales = "free") +
labs(title = "Price as a function of Sentiment Result") +
plotTheme()
#SELECT Review_Polarity,id
Thanks to the GIS Portal from the city of Amsterdam, we could get access to a bunch of useful data
- `high-rise`: high-rise buildings built in each tract
- `green-roof`: green roofs with its surface in each tract
- `wall art`: number of wall art in each tract
- `market`: Number of market in each tract
- `swimming water`: Number of swimming pool in each tract
- `tram_metro`: trams and metros in the whole city
- `flood`:flood area in each tract
- `parking`: parking lot in each tract
# 1.Load high-rise data with 3 attributes: height, year, and geometry
highrise.sf <-
st_read("https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=HOOGBOUW_PUNT&THEMA=hoogbouw") %>%
dplyr::select(Hoogte, Jaar, geometry) %>%
rename(
height = Hoogte,
year = Jaar
)%>%
st_transform('EPSG:7415')
# 2.Load green roof data with 2 attributes
greenroof.sf <-
st_read("https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=GROENE_DAKEN&THEMA=groene_daken") %>%
dplyr::select(Oppervlakte_m2, geometry) %>%
rename(surface = Oppervlakte_m2)%>%
st_transform('EPSG:7415')
# 3.Load wall art data
wallart.sf <- st_read('https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=WANDKUNST&THEMA=wandkunst')%>%
dplyr::select(geometry) %>%
st_transform('EPSG:7415')
# 4.Load market data
market.sf <- st_read('https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=MARKTEN&THEMA=markten')%>%
dplyr::select(geometry) %>%
st_transform('EPSG:7415')
# 5.Load swimming water data
swimming.sf <- st_read('https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=ZWEMWATER&THEMA=zwemwater')%>%
dplyr::select(geometry) %>%
st_transform('EPSG:7415')
# 6.Load tram and metro data
tram_metro.sf <- st_read('https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=TRAMMETRO_PUNTEN_2019&THEMA=trammetro')%>%
dplyr::select( geometry) %>%
st_transform('EPSG:7415')
# 7.Load flood data
flood.sf <- st_read('https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=KLIMAAT_OVERSTROMING&THEMA=klimaatadaptatie')%>%
dplyr::select(geometry) %>%
st_transform('EPSG:7415')
# 8.Load parking pressure data
parking.sf <- st_read('https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=PARKEERDRUK_BUURTEN&THEMA=parkeerdruk')%>%
dplyr::select(geometry) %>%
st_transform('EPSG:7415')
# 9.Heating supply
heating.sf <- st_read('https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=STADSWARMTEKOUDE_WIJK&THEMA=stadswarmtekoude')%>%
dplyr::select(geometry) %>%
st_transform('EPSG:7415')
# 10 parks
# https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=PARKPLANTSOENGROEN&THEMA=stadsparken
# 11 green project
# https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=GROENPROJECTEN&THEMA=groenprojecten
#12. compost
# https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=BUURTCOMPOST&THEMA=buurtcompost
The regional housing value and house age
data could also be good indicator of the Airbnb listing prices.
housing_value <-
st_read("https://raw.githubusercontent.com/ObjQIAN/Airbnb_AMS/main/data/housing_value_ams.geojson") %>%
st_transform('EPSG:7415') %>%
select(AvrVal)
housing_value$ID0 <- seq.int(nrow(housing_value))
housing_value_df <- housing_value %>% st_drop_geometry()
temp_nearest <- st_nearest_feature(nhoods,housing_value) %>% as_data_frame()%>%rename(ID0 = value)%>%left_join(housing_value_df, by = 'ID0')
nhoods <- cbind(nhoods, temp_nearest)
#13. age
age.sf <- st_read('https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=BOUWJAAR&THEMA=bouwjaar')%>%
st_transform('EPSG:7415')%>%
mutate(age = 2020-Bouwjaar)%>%
select(age)
age.sf$ID1 <- seq.int(nrow(age.sf))
age_df <- age.sf %>% st_drop_geometry()
temp_near_age <- st_nearest_feature(nhoods,age.sf) %>% as_data_frame()%>%rename(ID1 = value)%>%left_join(age_df, by = 'ID1')
nhoods <- cbind(nhoods, temp_near_age)
nhoods%>%select(price,age,AvrVal)%>%st_drop_geometry()%>% na.omit()%>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .5, color = "#2A9D8F", alpha = 0.45) +
geom_smooth(data = . %>% filter(price >0), method = "lm", se=F, colour = "#005B96") +
facet_wrap(~Variable, ncol = 2, scales = "free") +
labs(title = "Price as a function of Housing Property") +
plotTheme()
In this section, we examine the impact of proximity to critical infrastructure on the desirability and pricing of Airbnb properties. The infrastructure we selected is based on the variables imported above. We calculated the distance of each property from the nearest 1 to 5 key infrastructure through nn_function, and performed correlation analysis between the results and Price.
# 1.high-rise
nhoods$highrise.Buffer <- nhoods %>%
st_buffer(660) %>%
aggregate(mutate(highrise.sf, counter = 1),., sum) %>%
pull(counter)
nhoods <-
nhoods %>%
mutate(
highrise_nn1 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(highrise.sf)), 1),
highrise_nn2 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(highrise.sf)), 2),
highrise_nn3 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(highrise.sf)), 3),
highrise_nn4 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(highrise.sf)), 4),
highrise_nn5 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(highrise.sf)), 5))
nhoods %>%
st_drop_geometry() %>%
dplyr::select(price, starts_with("highrise_")) %>%
filter(price <= 600) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .5, color = "#2A9D8F", alpha = 0.35) +
geom_smooth(data = . %>% filter(price > 0), method = "lm", se=F, colour = "#005B96") +
facet_wrap(~Variable, nrow = 1, scales = "free") +
labs(title = "Price as a function of distance to high-rise") +
theme_light()
# 2.green roofs
nhoods$greenroof.Buffer <- nhoods %>%
st_buffer(660) %>%
aggregate(mutate(greenroof.sf, counter = 1),., sum) %>%
pull(counter)
nhoods <-
nhoods %>%
mutate(
greenroof_nn1 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(greenroof.sf)), 1),
greenroof_nn2 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(greenroof.sf)), 2),
greenroof_nn3 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(greenroof.sf)), 3),
greenroof_nn4 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(greenroof.sf)), 4),
greenroof_nn5 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(greenroof.sf)), 5))
nhoods %>%
st_drop_geometry() %>%
dplyr::select(price, starts_with("greenroof_")) %>%
filter(price <= 600) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .5, color = "#2A9D8F", alpha = 0.35) +
geom_smooth(data = . %>% filter(price > 0), method = "lm", se=F, colour = "#005B96") +
facet_wrap(~Variable, nrow = 1, scales = "free") +
labs(title = "Price as a function of distance to green roofs") +
theme_light()
# 3.wall art
nhoods$wallart.Buffer <- nhoods %>%
st_buffer(660) %>%
aggregate(mutate(wallart.sf, counter = 1),., sum) %>%
pull(counter)
nhoods <-
nhoods %>%
mutate(
wallart_nn1 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(wallart.sf)), 1),
wallart_nn2 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(wallart.sf)), 2),
wallart_nn3 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(wallart.sf)), 3),
wallart_nn4 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(wallart.sf)), 4),
wallart_nn5 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(wallart.sf)), 5))
nhoods %>%
st_drop_geometry() %>%
dplyr::select(price, starts_with("wallart_")) %>%
filter(price <= 600) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .5, color = "#2A9D8F", alpha = 0.35) +
geom_smooth(data = . %>% filter(price > 0), method = "lm", se=F, colour = "#005B96") +
facet_wrap(~Variable, nrow = 1, scales = "free") +
labs(title = "Price as a function of distance to wall art") +
theme_light()
# 4.market data
nhoods$market.Buffer <- nhoods %>%
st_buffer(660) %>%
aggregate(mutate(market.sf, counter = 1),., sum) %>%
pull(counter)
nhoods <-
nhoods %>%
mutate(
market_nn1 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(market.sf)), 1),
market_nn2 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(market.sf)), 2),
market_nn3 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(market.sf)), 3),
market_nn4 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(market.sf)), 4),
market_nn5 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(market.sf)), 5))
nhoods %>%
st_drop_geometry() %>%
dplyr::select(price, starts_with("market_")) %>%
filter(price <= 600) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .5, color = "#2A9D8F", alpha = 0.35) +
geom_smooth(data = . %>% filter(price > 0), method = "lm", se=F, colour = "#005B96") +
facet_wrap(~Variable, nrow = 1, scales = "free") +
labs(title = "Price as a function of distance to markets") +
theme_light()
# 5.swimming pool data
nhoods$swimming.Buffer <- nhoods %>%
st_buffer(660) %>%
aggregate(mutate(swimming.sf, counter = 1),., sum) %>%
pull(counter)
nhoods <-
nhoods %>%
mutate(
swimming_nn1 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(swimming.sf)), 1),
swimming_nn2 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(swimming.sf)), 2),
swimming_nn3 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(swimming.sf)), 3),
swimming_nn4 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(swimming.sf)), 4),
swimming_nn5 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(swimming.sf)), 5))
nhoods %>%
st_drop_geometry() %>%
dplyr::select(price, starts_with("swimming_")) %>%
filter(price <= 600) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .5, color = "#2A9D8F", alpha = 0.35) +
geom_smooth(data = . %>% filter(price > 0), method = "lm", se=F, colour = "#005B96") +
facet_wrap(~Variable, nrow = 1, scales = "free") +
labs(title = "Price as a function of distance to swimming pools") +
theme_light()
#6 Tram & Metro
nhoods$tram.Buffer <- nhoods %>%
st_buffer(660) %>%
aggregate(mutate(tram_metro.sf, counter = 1),., sum) %>%
pull(counter)
nhoods <-
nhoods %>%
mutate(
tram_nn1 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(tram_metro.sf)), 1),
tram_nn2 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(tram_metro.sf)), 2),
tram_nn3 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(tram_metro.sf)), 3),
tram_nn4 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(tram_metro.sf)), 4),
tram_nn5 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(tram_metro.sf)), 5))
nhoods %>%
st_drop_geometry() %>%
dplyr::select(price, starts_with("tram_")) %>%
filter(price <= 1000000) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .5, color = "#2A9D8F", alpha = 0.35) +
geom_smooth(data = . %>% filter(price > 0), method = "lm", se=F, colour = "#005B96") +
facet_wrap(~Variable, nrow = 1, scales = "free") +
labs(title = "Price as a function of Distance to Tram & Metro Stations") +
theme_light()
Based on the above analysis, we continued to conduct correlation
analysis to determine which variables are important factors affecting
Airbnb house prices. Among them, greenroof_nn3,
wallart_nn3, market_nn2,
tram_nn3, age and AvrVal are more
significant.
ggcorrplot(
round(cor(nhoods%>%select(highrise_nn1,greenroof_nn3,wallart_nn3,market_nn2,swimming_nn3,tram_nn3,price,age,AvrVal)%>%st_drop_geometry()%>%na.omit()), 1),
p.mat = cor_pmat(nhoods%>%select(highrise_nn1,greenroof_nn3,wallart_nn3,market_nn2,swimming_nn3,tram_nn3,price,age,AvrVal)%>%st_drop_geometry()%>%na.omit()),
colors = c("#E63946", "white", "#2A9D8F"),
type="lower",
insig = "blank",
lab = TRUE) +
labs(title = "Correlation across distance factors")
#select greenroof_nn3,wallart_nn3,market_nn2,tram_nn3,age,AvrVal
In this section, we integrate census data to enrich our analysis of Airbnb listings. Census data can provide valuable background information about the communities where these properties are located, including demographic, economic and social characteristics. This additional layer of data allows for a more complete understanding of the factors that influence the desirability and pricing of Airbnb properties. Our data comes from the 2019 CBS census. This document is similar in structure to the American Community Survey and can provide critical support.
# Load acs data, need data selection here
ams_census <-
st_read("https://raw.githubusercontent.com/ObjQIAN/Airbnb_AMS/main/data/ams_census.geojson") %>%
st_transform('EPSG:7415') %>%
rename(residents = aantal_inwoners,
male_residents = aantal_mannen,
famale_residents = aantal_vrouwen,
pct_dutch_bkg = percentage_nederlandse_achtergrond,
pct_wes_bkg = percentage_westerse_migr_achtergr,
pct_nonwes_bkg = percentage_niet_westerse_migr_achtergr,
avr_fam_size = gemiddelde_huishoudensgrootte,
low_income_pct = percentage_laag_inkomen_huishouden,
high_income_pct = percentage_hoog_inkomen_huishouden,
median_HH = mediaan_inkomen_huishouden,
rental_home =percentage_huurwoningen,
# rental_corp = aantal_huurwoningen_in_bezit_woningcorporaties,
owned_pct = percentage_koopwoningen,
not_ocp_home =aantal_niet_bewoonde_woningen,
gas_using = gemiddeld_gasverbruik_woning,
# zip_house_value = gemiddelde_woz_waarde_woning,
elec_using = gemiddeld_elektriciteitsverbruik_woning,
dis_to_train = dichtstbijzijnde_treinstation_afstand_in_km,
dis_to_sub = dichtstbijzijnde_overstapstation_afstand_in_km,
fire_station = dichtstbijzijnde_brandweerkazerne_afstand_in_km,
daycare_num = kinderdagverblijf_aantal_binnen_3_km,
)%>%
dplyr::select(postcode4,daycare_num,fire_station,dis_to_sub,dis_to_train,elec_using,gas_using,not_ocp_home,owned_pct,rental_home,median_HH,high_income_pct,low_income_pct,avr_fam_size,pct_nonwes_bkg,pct_wes_bkg,pct_dutch_bkg,famale_residents,male_residents,residents)
ams_census$median_HH_num <- ifelse(ams_census$median_HH == "00-20 laag", 1,
ifelse(ams_census$median_HH == "20-40 onder midden", 2,
ifelse(ams_census$median_HH == "20-60 onder midden tot midden", 3,
ifelse(ams_census$median_HH == "40-60 midden", 4,
ifelse(ams_census$median_HH == "40-80 midden tot boven midden", 5,
ifelse(ams_census$median_HH == "60-100 boven midden tot hoog", 6,
ifelse(ams_census$median_HH == "60-80 boven midden", 7,
ifelse(ams_census$median_HH == "80-100 hoog", 8, NA))))))))
nhoods <- st_join(nhoods,ams_census)%>% filter(postcode4 != 1101 & postcode4 != 1043 )
#daycare_num,fire_station,dis_to_sub,dis_to_train,elec_using,zip_house_value,gas_using,not_ocp_home,owned_pct,rental_corp,rental_home,median_HH,high_income_pct,low_income_pct,avr_fam_size,pct_nonwes_bkg,pct_wes_bkg,pct_dutch_bkg,famale_residents,male_residents,residents
Use ggcorrplot to perform correlation analysis on the
results. We selected postcode4, daycare_num,
gas_using, owned_pct,
median_HH_num, pct_dutch_bkg,
residents, price for modeling
census_les <-nhoods%>%st_drop_geometry()%>%select(postcode4,daycare_num,fire_station,dis_to_sub,dis_to_train,elec_using,gas_using,not_ocp_home,owned_pct,rental_home,median_HH_num,high_income_pct,low_income_pct,avr_fam_size,pct_nonwes_bkg,pct_wes_bkg,pct_dutch_bkg,residents,price)%>%na.omit()
census_les<- transform(census_les,postcode4 = as.numeric(postcode4)) %>% filter(postcode4 != 1101 & postcode4 != 1043 )
ggcorrplot(
round(cor(census_les), 1),
p.mat = cor_pmat(census_les),
colors = c("#E63946", "white", "#2A9D8F"),
type="lower",
insig = "blank",
#lab = TRUE
) +
labs(title = "Correlation across census data")
census_les %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .5, color = "#2A9D8F", alpha = 0.45) +
geom_smooth(data = . %>% filter(price >0), method = "lm", se=F, colour = "#005B96") +
facet_wrap(~Variable, ncol = 5, scales = "free") +
labs(title = "Price as a function of census values") +
plotTheme()
ggcorrplot(
round(cor(census_les%>%select(postcode4,daycare_num,dis_to_sub,elec_using,gas_using,owned_pct,median_HH_num,avr_fam_size,pct_wes_bkg,pct_dutch_bkg,residents,price)), 1),
p.mat = cor_pmat(census_les%>%select(postcode4,daycare_num,dis_to_sub,elec_using,gas_using,owned_pct,median_HH_num,avr_fam_size,pct_wes_bkg,pct_dutch_bkg,residents,price)),
colors = c("#E63946", "white", "#2A9D8F"),
type="lower",
insig = "blank",
lab = TRUE) +
labs(title = "Correlation across census values")
#select postcode4,daycare_num,dis_to_sub,elec_using,gas_using,owned_pct,median_HH_num,avr_fam_size,pct_wes_bkg,pct_dutch_bkg,residents,price
In this section, we enter the model building phase. Our goal is to develop a predictive model that can accurately predict Airbnb listing prices based on various factors identified during the data analysis stage. The models we will explore include baseline models, neighborhood models, lag models, and models that include neighborhood effects and price lag as well as sentiment factors.
In the data selection stage, we select the most relevant and influential data for analysis through the results of the correlation scores above, laying the foundation for our modeling process. This step is crucial, ‘garbage in, garbage out’. Because the accuracy and generalization of our predictive models depends heavily on the quality and appropriateness of the selected data.
listing_details<- listing_details%>%st_drop_geometry()%>%select(-room_type,-reviews_per_month,-minimum_nights)
nhoods_sub <- left_join(nhoods,listing_details, by = 'id')%>% filter(price < 300)%>%
select(id,price,cancellation_policy,bed_type,property_type,room_type,accommodates,bathrooms,bedrooms,cleaning_fee,amenities_lengths,extra_people,reviews_per_month,review_polarity,review_scores_rating,greenroof_nn3,wallart_nn3,market_nn2,tram_nn3,age,AvrVal,postcode4,daycare_num,dis_to_sub,elec_using,gas_using,owned_pct,median_HH_num,avr_fam_size,pct_wes_bkg,pct_dutch_bkg,residents,neighbourhood) %>%na.omit()
nhoods_sub <-transform(nhoods_sub, id = as.numeric(id))
#,postcode4 = as.numeric(postcode4)
numericVars <-
select_if(st_drop_geometry(nhoods_sub), is.numeric) %>% na.omit()
The neighborhood effects on Airbnb prices is studied. Specifically, we employ Moran’s I statistic to test for spatial autocorrelation in listing prices. Spatial autocorrelation refers to the degree to which objects close to one another are similar in value—in this case. The result shows that the price is effected, this also indicates that we should consider the impact of spatial lag in the subsequent modeling process..
coords <- st_coordinates(nhoods_sub)
neighborList <- knn2nb(knearneigh(coords, 4))
spatialWeights <- nb2listw(neighborList, style="W")
nhoods_sub$lagPrice <- lag.listw(spatialWeights, nhoods_sub$price)
moranTest <- moran.mc(nhoods_sub$price,
spatialWeights, nsim = 999)
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moranTest$statistic), colour = "#E63946",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in orange",
x="Moran's I",
y="Count") +
plotTheme()
In order to make the prediction results more accurate, we tested combinations of different variables. Most of them had no significant impact on the results, but we left
per_bathroom | A function of bathrooms and
accommodations, this may represent the adequacy of accommodations in a
cheap listing,
fixed_score uses lagPrice and the review quality of this
listing to infer the price model. In a high-priced area, if a listing
has higher reviews, it may mean that its price is lower.
#额外加了review_scores_rating
nhoods_sub$per_bathroom <- nhoods_sub$bathroom / nhoods_sub$accommodates
#current price value
nhoods_sub <- nhoods_sub%>%
mutate(fixed_score = lagPrice*(0.1 +review_polarity)*(101-review_scores_rating)*bedrooms*market_nn2)
We have conducted five distinct linear models, each utilizing different aspects of our comprehensive dataset. The goal is to understand and accurately predict listing prices, considering a variety of factors. By comparing the performance of these model, we could find the better way for our product. The models are:
Baseline Model | This model serves as our foundation. It
incorporates fundamental features Neighborhoods Model |
This model adds impact of neighborhood characteristics on listing prices
Zip Model | This model zooms adds impact of importance of
postal code areas. Spatial Lag Model | This model adds how
the prices of nearby listings (spatial lag) influence a particular
listing’s price. Neighborhoods with LagPrice Model | It
combines the insights from neighborhood characteristics with spatial lag
considerations
data_to_train <- nhoods_sub %>%
select(price,cancellation_policy,bed_type,room_type,accommodates,bathrooms,bedrooms,cleaning_fee,amenities_lengths,extra_people,reviews_per_month,review_polarity,greenroof_nn3,wallart_nn3,market_nn2,tram_nn3,age,AvrVal,postcode4,daycare_num,dis_to_sub,gas_using,owned_pct,median_HH_num,pct_wes_bkg,pct_dutch_bkg,residents,per_bathroom,fixed_score,neighbourhood,lagPrice)
set.seed(825)
inTrain <- createDataPartition(
y = paste(data_to_train$room_type),
p = .70, list = FALSE)
ams.training <- data_to_train[inTrain,]
ams.test <- data_to_train[-inTrain,]
ams.training.baseline <- ams.training %>%
select(-postcode4,-neighbourhood,-lagPrice,-per_bathroom,-fixed_score)
ams.training.nhoods <- ams.training %>%
select(-postcode4,-lagPrice,-per_bathroom,-fixed_score)
ams.training.zip <- ams.training %>%
select(-neighbourhood,-lagPrice,-per_bathroom,-fixed_score)
ams.training.lag <- ams.training %>%
select(-postcode4,-neighbourhood)
ams.training.hoods_lag <- ams.training
#temp
reg.baseline <-
lm(price ~ ., data = st_drop_geometry(ams.training.baseline))
reg.nhoods <-
lm(price ~ ., data = st_drop_geometry(ams.training.nhoods))
reg.zip <-
lm(price ~ ., data = st_drop_geometry(ams.training.zip))
reg.lag <-
lm(price ~ ., data = st_drop_geometry(ams.training.lag))
reg.hoods_lag <-
lm(price ~ ., data = st_drop_geometry(ams.training.hoods_lag))
ams.test <-
ams.test %>%
na.omit()
ams.test.baseline <- ams.test %>%
mutate(Regression = "Baseline Regression",
listing_price.Predict = predict(reg.baseline, ams.test),
listing_price.Error = listing_price.Predict - price,
listing_price.AbsError = abs(listing_price.Predict - price),
listing_price.APE = (abs(listing_price.Predict - price)) / listing_price.Predict)
ams.test.nhoods <- ams.test %>%
mutate(Regression = "Neighborhoods Regression",
listing_price.Predict = predict(reg.nhoods, ams.test),
listing_price.Error = listing_price.Predict - price,
listing_price.AbsError = abs(listing_price.Predict - price),
listing_price.APE = (abs(listing_price.Predict - price)) / listing_price.Predict)
ams.test.zip <- ams.test %>%
mutate(Regression = "Zip Regression",
listing_price.Predict = predict(reg.zip, ams.test),
listing_price.Error = listing_price.Predict - price,
listing_price.AbsError = abs(listing_price.Predict - price),
listing_price.APE = (abs(listing_price.Predict - price)) / listing_price.Predict)
ams.test.lag <- ams.test %>%
mutate(Regression = "Spatial Lag Regression",
listing_price.Predict = predict(reg.lag, ams.test),
listing_price.Error = listing_price.Predict - price,
listing_price.AbsError = abs(listing_price.Predict - price),
listing_price.APE = (abs(listing_price.Predict - price)) / listing_price.Predict)
ams.test.hoods_lag <- ams.test %>%
mutate(Regression = "Neighborhoods with LagPrice Regression",
listing_price.Predict = predict(reg.hoods_lag, ams.test),
listing_price.Error = listing_price.Predict - price,
listing_price.AbsError = abs(listing_price.Predict - price),
listing_price.APE = (abs(listing_price.Predict - price)) / listing_price.Predict)
model_MAE_summaries <- bind_rows(
ams.test.baseline %>%
st_drop_geometry() %>%
summarise(MAE = mean(listing_price.AbsError),
MAPE = mean(abs(listing_price.APE)*100)) %>% mutate(model = "Baseline Model"),
ams.test.nhoods %>%
st_drop_geometry() %>%
summarise(MAE = mean(listing_price.AbsError),
MAPE = mean(abs(listing_price.APE)*100)) %>% mutate(model = "Neighborhoods Model"),
ams.test.zip %>%
st_drop_geometry() %>%
summarise(MAE = mean(listing_price.AbsError),
MAPE = mean(abs(listing_price.APE)*100)) %>% mutate(model = "Zip Model"),
ams.test.lag %>%
st_drop_geometry() %>%
summarise(MAE = mean(listing_price.AbsError),
MAPE = mean(abs(listing_price.APE)*100)) %>% mutate(model = "Spatial Lag Model"),
ams.test.hoods_lag %>%
st_drop_geometry() %>%
summarise(MAE = mean(listing_price.AbsError),
MAPE = mean(abs(listing_price.APE)*100)) %>% mutate(model = "Neighborhoods with LagPrice Model"),
)
kable(model_MAE_summaries, caption = 'Table 4.1 Summary of Model Performance') %>%
kable_styling("striped", full_width = F)
| MAE | MAPE | model |
|---|---|---|
| 29.60733 | 22.50012 | Baseline Model |
| 29.18136 | 22.06071 | Neighborhoods Model |
| 29.19889 | 22.01905 | Zip Model |
| 29.40483 | 22.36386 | Spatial Lag Model |
| 29.02514 | 21.91027 | Neighborhoods with LagPrice Model |
We also tested the model on the test set - and here are the result of them.
The overall result is good with a 21% MAPE and 0.45 of R Square. The difference among the result are rather the same, which could indicate the model did not handle neighborhood.
model_summaries <- bind_rows(
glance( reg.baseline) %>% mutate(model = "Baseline Model"),
glance(reg.nhoods) %>% mutate(model = "Neighborhoods Model"),
glance(reg.zip) %>% mutate(model = "Zip Model"),
glance(reg.lag) %>% mutate(model = "Spatial Lag Model"),
glance(reg.hoods_lag) %>% mutate(model = "Neighborhoods with LagPrice Model")
)
# Create the table
kable(model_summaries, caption = 'Table 4.2 Summary of Model Evaluation Parameters') %>%
kable_styling("striped", full_width = F)
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs | model |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.4336975 | 0.4321463 | 39.39462 | 279.58127 | 0 | 31 | -57779.53 | 115625.1 | 115867.2 | 17563258 | 11317 | 11349 | Baseline Model |
| 0.4406986 | 0.4381239 | 39.18672 | 171.16585 | 0 | 52 | -57708.94 | 115525.9 | 115922.1 | 17346126 | 11296 | 11349 | Neighborhoods Model |
| 0.4453611 | 0.4407781 | 39.09406 | 97.17722 | 0 | 93 | -57661.44 | 115512.9 | 116209.9 | 17201525 | 11255 | 11349 | Zip Model |
| 0.4395207 | 0.4378364 | 39.19675 | 260.94982 | 0 | 34 | -57720.88 | 115513.8 | 115777.9 | 17382658 | 11314 | 11349 | Spatial Lag Model |
| 0.4518056 | 0.4462919 | 38.90085 | 81.94296 | 0 | 113 | -57595.12 | 115420.2 | 116264.0 | 17001655 | 11235 | 11349 | Neighborhoods with LagPrice Model |
plot_summs(reg.baseline, scale = TRUE)
To better exam the model and explore the real world impact, we made further testing to the model.
The first step to assess the performance of our most comprehensive model, the “Neighborhoods with LagPrice Model”. A distribution map of MAPE allows us to observe how well the model’s predictions align with actual prices across different areas of Amsterdam.
ggplot(ams.test.hoods_lag%>% filter(listing_price.APE<=1)) +
geom_sf(aes(colour = listing_price.APE)) +
scale_colour_viridis_c() +
geom_sf(data = district, fill = NA) +
labs(title = "MAPE Distribution",
subtitle = 'Amsterdam, 2019') +
mapTheme()
Then we created a scatter plot to compare the actual listing prices against the predicted prices from the “Neighborhoods with LagPrice Model.” This visualization shows that the original data is relatively dispersed, can our model stocked more on the MAEs rather than the explanation of trends.
ggplot(ams.test.hoods_lag, aes(x = price, y = listing_price.Predict)) +
geom_point(alpha = 0.2,color = "#2A9D8F") +
labs(title = "Predicted price vs real listing price",
subtitle = "Based on neighborhood-lagprice model",
x = "Actual listing",
y = "Predicted listings") +
geom_abline() +
geom_smooth(method = "lm", se = FALSE, color ='#E63946' ) +
theme_minimal()
We aimed to evaluate the effectiveness of the spatial hood-lag model. To do this, we used Moran’s I statistic again to test for spatial autocorrelation in the prediction errors of the model.
Spatial autocorrelation in this context would mean that the model’s errors are totaly randomly distributed across space and our model is effective.
coords <- st_coordinates(ams.test.hoods_lag)
neighborList <- knn2nb(knearneigh(coords, 4))
spatialWeights.nhoods <- nb2listw(neighborList, style="W")
moranTest.nhoods_lag <- moran.mc(ams.test.hoods_lag$listing_price.Error,
spatialWeights.nhoods, nsim = 999)
ggplot(as.data.frame(moranTest.nhoods_lag$res[c(1:999)]), aes(moranTest.nhoods_lag$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moranTest.nhoods_lag$statistic), colour = '#2A9D8F',size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Moran's I for Prediction Result Errors",
subtitle= "Based on Prediction Error for Neighborhoods with LagPrice Model",
x="Moran's I",
y="Count") +
plotTheme()
In this section, we are consolidating the results from all the regression models to compare their performance across different metrics. This approach allows us to assess each model’s effectiveness in predicting Airbnb listing prices and to understand how they perform relative to each other.
Among them, Neighborhoods with LagPrice Model, which
combines neighborhood characteristics with spatial lag, shows the best
performance among all models. It underscores the importance of
considering both local attributes and the influence of nearby
listings.
AllRegressions <-
rbind(
dplyr::select(ams.test.baseline, starts_with("listing_"), Regression, neighbourhood, price) %>%
mutate(lagPriceError = lag.listw(spatialWeights.nhoods, listing_price.Error)),
dplyr::select(ams.test.nhoods, starts_with("listing_"), Regression, neighbourhood, price) %>%
mutate(lagPriceError = lag.listw(spatialWeights.nhoods, listing_price.Error)),
dplyr::select(ams.test.zip, starts_with("listing_"), Regression, neighbourhood, price) %>%
mutate(lagPriceError = lag.listw(spatialWeights.nhoods, listing_price.Error)),
dplyr::select(ams.test.lag, starts_with("listing_"), Regression, neighbourhood, price) %>%
mutate(lagPriceError = lag.listw(spatialWeights.nhoods, listing_price.Error)),
dplyr::select(ams.test.hoods_lag, starts_with("listing_"), Regression, neighbourhood, price) %>%
mutate(lagPriceError = lag.listw(spatialWeights.nhoods, listing_price.Error)))
AllRegressions <- na.omit(AllRegressions)
st_drop_geometry(AllRegressions) %>%
gather(Variable, Value, -Regression, -neighbourhood) %>%
filter(Variable == "listing_price.AbsError" | Variable == "listing_price.APE") %>%
group_by(Regression, Variable) %>%
summarize(meanValue = mean(Value, na.rm = T)) %>%
spread(Variable, meanValue) %>%
kable(caption = 'Table 4.3 Summary of Model Errors')%>%
kable_styling("striped", full_width = F)
| Regression | listing_price.AbsError | listing_price.APE |
|---|---|---|
| Baseline Regression | 29.60733 | 0.2250012 |
| Neighborhoods Regression | 29.18136 | 0.2206071 |
| Neighborhoods with LagPrice Regression | 29.02514 | 0.2191027 |
| Spatial Lag Regression | 29.40483 | 0.2236386 |
| Zip Regression | 29.19889 | 0.2201905 |
Then we visualized the relationship between various independent variables and the prediction errors (Absolute Error) of the “Neighborhoods with LagPrice” model. This clearly shows some weakness of our model - when dealing with luxury houses and very large properties (especially having large number of bathrooms and bedrooms), the accuracy drops a lot.
ams.test.hoods_lag%>% st_drop_geometry() %>% select(accommodates,bathrooms,bedrooms,cleaning_fee,lagPrice,amenities_lengths,extra_people,reviews_per_month,review_polarity,greenroof_nn3,owned_pct,pct_dutch_bkg,listing_price.AbsError)%>% na.omit()%>%
gather(Variable, Value, -listing_price.AbsError) %>%
ggplot(aes(Value, listing_price.AbsError)) +
geom_point(size = .5, color = "#2A9D8F", alpha = 0.45) +
geom_smooth( data = . %>% filter(listing_price.AbsError > 0), method = "lm", se=F, colour = "#005B96") +
facet_wrap(~Variable, ncol = 4, scales = "free") +
labs(title = "Price AbsError as a function of variables") +
plotTheme()
In this section, we’re visualizing and comparing the performance of the various regression models developed. The aim is to understand how well each model predicts Airbnb listing prices compared to the actual prices. Based on the result, all of our model are fantastic - as a linear model, It can not predict all the detail of data change, but provide significant predictive power, especially when dealing with large and diverse data sets like ours. Our model shows that even though linear regression itself is simple, we can achieve high accuracy in predicting Airbnb listing prices.
If our raw data were less dispersed, other predictors would improve significantly.
AllRegressions%>% st_drop_geometry() %>%
dplyr::select(listing_price.Predict, price, Regression) %>%
ggplot(aes(price, listing_price.Predict)) +
geom_point(color = '#005B96',alpha = 0.1) +
stat_smooth(aes(price, price),
method = "lm", se = FALSE, size = 1, colour="#E63946") +
stat_smooth(aes(listing_price.Predict, price),
method = "lm", se = FALSE, size = 1, colour="#2A9D8F") +
facet_wrap(~Regression) +
labs(title="Predicted sale price as a function of observed price",
subtitle="Red line represents a perfect prediction; Green line represents prediction") +
plotTheme()
We started evaluating the robustness and reliability of our regression model through cross-validation.
Cross-validation is a statistical method used to
estimate the skill of machine learning models. Our model shows a great
accuracy, narrow distribution centered around a MAE value of
30 has indicated high reliability and accuracy.
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)
reg.cv <-
train(price ~ ., data = st_drop_geometry(data_to_train),
method = "lm", trControl = fitControl, na.action = na.pass)
ggplot(data = reg.cv$resample) +
geom_histogram(aes(x = reg.cv$resample$MAE), fill = '#2A9D8F') +
labs(title="Distribution of Cross-validation MAE",
subtitle = "K = 100\n",
caption = "Figure RESULT 4.2") +
xlab('MAE of Model') +
ylab('Count') +
plotTheme()
In this section, we’re focusing on visualizing the results of our regression models in relation to different neighborhoods. This approach allows us to understand how the performance of each model varies across various areas in Amsterdam. We use census data ( ams_census ) to classify neighborhoods according to racial background, family background, and income level, and then evaluate the model’s performance in these different contexts.
st_drop_geometry(AllRegressions) %>%
group_by(Regression, neighbourhood) %>%
summarize(mean.MAPE = mean(abs(listing_price.APE * 100), na.rm = T)) %>%
ungroup() %>%
left_join(district) %>%
st_sf() %>%
ggplot() +
geom_sf(aes(fill = mean.MAPE)) +
geom_sf(data = AllRegressions, colour = '#005B96', size = .1,alpha = 0.2) +
facet_wrap(~Regression) +
scale_fill_gradient(low = palette5[4], high = palette5[2],
name = "Mean Absolute Percent Error") +
labs(title = "Mean test set MAPE by neighborhood") +
mapTheme()
The following series of code snippets aims to understand how our model performs in different socio-economic contexts in Amsterdam. We use census data (ams_census) to classify neighborhoods according to racial background, family background, and income level, and then evaluate the model’s performance in these different contexts.
tracts17 <- ams_census %>%
select(median_HH, pct_dutch_bkg, avr_fam_size)%>%
mutate(raceContext = ifelse(pct_dutch_bkg > 50, "Majority Dutch", "International Zone"),
famContext = ifelse(avr_fam_size > 2, "Large Fams", "Small Fams"))
grid.arrange(ncol = 2,
ggplot() + geom_sf(data = na.omit(tracts17), aes(fill = raceContext)) +
scale_fill_manual(values = c("#2A9D8F", "#E63946"), name="Race Context") +
labs(title = "Race Context") +
mapTheme() + theme(legend.position="bottom"),
ggplot() + geom_sf(data = na.omit(tracts17), aes(fill = famContext)) +
scale_fill_manual(values = c("#2A9D8F", "#E63946"), name="Family Context") +
labs(title = "Family Context") +
mapTheme() + theme(legend.position="bottom"),
ggplot() + geom_sf(data = na.omit(tracts17), aes(fill = median_HH)) +
scale_fill_manual(values = c("#2a9d8f", "#458f85", "#60807a", "#7b7270", "#956465", "#b0565b", "#cb4750", "#e63946"), name="Income Context") +
labs(title = "Income Context") +
mapTheme() + theme(legend.position="bottom"))
This code combines the regression results with the racial background data and summarizes the MAPE for each model in each racial background. The results show that our model performs consistently in different ethnic gathering areas and exhibits high generality.
st_join(AllRegressions, tracts17) %>%
group_by(Regression, raceContext) %>%
summarize(mean.MAPE = scales::percent(mean(listing_price.APE, na.rm = T))) %>%
st_drop_geometry() %>%
spread(raceContext, mean.MAPE) %>%
kable(caption = "Test set MAPE by neighborhood racial context")
| Regression | International Zone | Majority Dutch |
|---|---|---|
| Baseline Regression | 23% | 22% |
| Neighborhoods Regression | 22% | 22% |
| Neighborhoods with LagPrice Regression | 22% | 21% |
| Spatial Lag Regression | 23% | 22% |
| Zip Regression | 22% | 22% |
This code combines the regression results with household size background data and summarizes the MAPE for each model in each size context. The results show that our model performs better in areas where smaller families gather, which may indicate urban centers or areas where young people gather.
st_join(AllRegressions, tracts17) %>%
filter(!is.na(famContext)) %>%
group_by(Regression, famContext) %>%
summarize(mean.MAPE = scales::percent(mean(listing_price.APE, na.rm = T))) %>%
st_drop_geometry() %>%
spread(famContext, mean.MAPE) %>%
kable(caption = "Test set MAPE by neighborhood Family context")
| Regression | Large Fams | Small Fams |
|---|---|---|
| Baseline Regression | 29% | 22% |
| Neighborhoods Regression | 27% | 22% |
| Neighborhoods with LagPrice Regression | 26% | 22% |
| Spatial Lag Regression | 28% | 22% |
| Zip Regression | 26% | 22% |
This code combines the regression results with the income context data and summarizes the MAPE for each model in the income context. The results show that our model performs relatively stably in different income regions.
st_join(AllRegressions, tracts17) %>%
filter(!is.na(median_HH)) %>%
group_by(Regression, median_HH) %>%
summarize(mean.MAPE = scales::percent(mean(listing_price.APE, na.rm = T))) %>%
st_drop_geometry() %>%
spread(median_HH, mean.MAPE) %>%
kable(caption = "Test set MAPE by neighborhood income context")
| Regression | 20-40 onder midden | 20-60 onder midden tot midden | 40-60 midden | 40-80 midden tot boven midden | 60-100 boven midden tot hoog | 60-80 boven midden | 80-100 hoog |
|---|---|---|---|---|---|---|---|
| Baseline Regression | 26% | 22% | 22% | 25% | 21% | 23% | 24% |
| Neighborhoods Regression | 24% | 22% | 21% | 22% | 22% | 23% | 24% |
| Neighborhoods with LagPrice Regression | 24% | 22% | 21% | 31% | 26% | 23% | 24% |
| Spatial Lag Regression | 26% | 22% | 22% | 25% | 20% | 23% | 25% |
| Zip Regression | 24% | 22% | 21% | 31% | 26% | 23% | 24% |
We analyzed Airbnb listings in Amsterdam using a variety of datasets and methodologies and derived some important insights. We have successfully developed and evaluated multiple linear models: Baseline Model, Neighborhoods Model, Zip Model, Spatial Lag Model, and Neighborhoods with LagPrice Model, providing a broad spectrum for comparison and understanding of listing prices. Each model provides a unique perspective on predicting Airbnb listing prices. Despite the simplicity of the linear model, our method shows considerable predictive accuracy, as evidenced by the cross-validation results and MAE distribution. But they may not fully capture the complex nonlinear relationships in the data. This is especially true when dealing with luxury listings and properties with a large number of rooms. At the same time, the dispersion of raw data creates challenges, especially in capturing trends and making accurate predictions of market outliers. Our approach combines traditional listing characteristics with spatial and socioeconomic factors, increasing the depth and accuracy of our analysis.
As a future direction, our plan is to further integrate this model into our plug-in. These insights can enhance the Airbnb user experience and, if incorporated with real-time data, will enhance the applicability and accuracy of the model. However, we also need to conduct more tests in cities other than Amsterdam to determine whether our model generalizes well. European cities will pose challenges to this model in terms of spatial pattern, distribution, design, and historical and cultural influence, but we are at least confident that it will show its talents in cities similar to Amsterdam.
References:
“R Document.” RDocumentation, www.rdocumentation.org/. Accessed 20 Dec. 2023.
Steif, Ken. Public Policy Analytics: Code and Context for Data Science in Government. CRC Press, Taylor & Francis Group, 2022.
Luo, Yuanhang, Xuanyu Zhou, and Yulian Zhou. “Predicting airbnb listing price across different cities.” (2019).
Rezazadeh Kalehbasti, Pouya, Liubov Nikolenko, and Hoormazd Rezaei. “Airbnb price prediction using machine learning and sentiment analysis.” International Cross-Domain Conference for Machine Learning and Knowledge Extraction. Cham: Springer International Publishing, 2021.